ORM: Use pydantic to specify a schema for each ORM entity #6255

sphuber · 2024-01-18T21:34:41Z

See corresponding AEP for details.

danielhollas · 2024-04-27T13:39:31Z

@sphuber just wondering, have you checked how this affects performance?
Does it influences import aiida.orm at all?

This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255

agoscinski · 2024-05-15T12:21:11Z

src/aiida/orm/nodes/data/code/installed.py

+        def serialize_computer(self, computer: Computer, _info):
+            return computer.label
+
+        # @model_validator(mode='after')  # type: ignore[misc]


Seems like some code that was forgotten

Thanks for checking out this PR @agoscinski . For some background info: I commented this out on purpose for now. Ideally, this check should be there because when storing the code, this condition should be satisfied. Unfortunately, however, this now also prevents just instantiating this model that validates the check, even if we don't plan on storing it. This is problematic for the export behavior, because that would rely on instantiating a Model that violates this uniqueness rules, even though in this case it doesn't apply.

The old CLI verdi code create core.code.installed would also rely on something like this check to immediately return an error as soon as label was specified in interactive mode and that label already existed for the same computer. It would then reprompt the user asking for a different label. I haven't been able to reimplement this feature in this PR so was keeping the comment as a reminder.

Safe to say that this PR is not completely done yet 😅

Is this still "in progress"?

@sphuber I'm looking at this now. Trying to probe the issue you mention in this thread. I did the following:

Create a hello_world@localhost installed code using verdi

Load the code and serialize it

Create an InstalledCode instance from 2
I expected an issue here, as from_serialized leverages the model, so the validator should've been triggered, raising an exception, as the label + computer already exist! No exception though.

As for your comment regarding the CLI, are you saying a similar validation should be implemented there? Sounds like it, but not in this PR, right?

Summarizing a recent meeting with @sphuber:

The currently commented out validator on the Code.Model is intended as a check on the uniqueness of Code label+computer.label tuples

However, one use of the pydantic Model of a code, the dynamic creation of the interactive options of verdi code create..., requires the Model only for access to the available properties. Unfortunately, the validator will fire on instantiation from the serialized code, which will conflict with the uniqueness constraint, making it impossible to use the Model in this case

The move to pydantic Models for this CLI application regresses the user experience in the sense that it does not fire the validator until after the user finishes the interactive part, even though the tuple is already available within the first three options (note that this is not tied to this PR, as the use of pydantic Models for this case was implemented in an earlier PR)

The CLI matter is not blocking for this PR, as I understand it. A solution can be explored in a follow-up PR.

@sphuber one thing that is still unclear to us is how in the CLI application a Model is created from an existing Code. Are we not discussing the creation of codes? How is there already a code?

It doesn't create a Model from an existing code. What the CLI code does is directly construct a Model from the values passed by the user to the CLI command. And then uses InstalledCode.from_model(model) to instantiate the InstalledCode instance which it then will store. The relevant line in the code:
https://github.com/sphuber/aiida-core/blob/d4c4d94851836e58e743d8f1a2d30c665af2ae3b/src/aiida/cmdline/commands/cmd_code.py#L35

So in that particular use case, the CLI implementation for creating codes, the fact that instantiating InstalledCode.Model for a label and computer that already exist raises a validation error is actually exactly what we want. Because we intend to then store an actual InstalledCode instance from that model directly after. The problem only comes in that now we can never instantiate this Model even if we never intend to store it as another ORM instance. The main example is when we want to serialize an existing ORM instance through the Model as an intermediate, as we have discussed in the comments above.

Thanks @sphuber. Two things:

Regarding unintended effects of the serialize_computer guard, I would argue that any situation prior to the future merging of this PR that required the label of the computer was a situation in which the computer was indeed an instance of Computer and thus would not be affected by this change. The only cases where computer is a str should be the serialization case, no?

Regarding the CLI case...

The problem only comes in that now we can never instantiate this Model even if we never intend to store it as another ORM instance. The main example is when we want to serialize an existing ORM instance through the Model as an intermediate, as we have discussed in the comments above.
Are you saying there IS a problem still, or are you simply reiterating the original problem? Because it seems to me that the use of model_construct is precisely the solution. Or am I missing something?

Regarding unintended effects of the serialize_computer guard, I would argue that any situation prior to the future merging of this PR that required the label of the computer was a situation in which the computer was indeed an instance of Computer and thus would not be affected by this change. The only cases where computer is a str should be the serialization case, no?

I am not worried about existing code, since the serialization is completely new. But I was just saying that with adding this workaround, although it fixes the immediate problem with using model_construct (which is itself already a bit of a workaround), I cannot guarantee that this will not lead to other problems/inconsistencies in the future. But perhaps this is still the best solution for now.

Are you saying there IS a problem still, or are you simply reiterating the original problem? Because it seems to me that the use of model_construct is precisely the solution. Or am I missing something?

No I was simply answering your question and providing more explanation. Using model_construct in the serialization code does seem to workaround the problem for now (in combination with the guard in the computer serializer)

Okay. Thanks for clarifying this matter 🙏

Hmm, I think maybe there's another potentially critical issue here.

I tried reconstructing an installed code instance. With the validation active on the model, I first need to make the above change (and related changes) to Entity.to_model(). This allows me to serialize the code. However, deserializing it using orm.InstalledCode.from_serialized(**code.serialize()) also raises the validation error. But here we shouldn't assume pre-validated input, so we can't use the model_construct solution in from_serialized.

I'm not sure how to get around this matter and allow the model validator. A REST API receiving a a serialized Code instance cannot assume the data is valid. In the case that the data came from the same AiiDA instance with which the REST API is linked, the validation will surely object.

@sphuber am I overlooking something?

update

Thinking on this some more, "real" use cases of REST API node transfer will likely be between AiiDA instances, so the collision of the same full code label should not happen? But even if this is true, a specialized warning would be better. Not sure where to implement it though.

unkcpz · 2024-01-20T22:49:57Z

pyproject.toml

@@ -132,6 +132,17 @@ requires-python = '>=3.9'
 'process.workflow.workchain' = 'aiida.orm.nodes.process.workflow.workchain:WorkChainNode'
 'process.workflow.workfunction' = 'aiida.orm.nodes.process.workflow.workfunction:WorkFunctionNode'

+[project.entry-points.'aiida.orm']


Is it really necessary to make orm pluggable, what is the use case of it?

@sphuber correct me if I'm wrong, but this has to do with the use of orm_class, right? Under the implementation section of the AEP, it states that orm_class is a shortcut to model_to_orm used to define a model field that points to an ORM instance. So instead of

class Model(...): user = MetadataField( ..., model_to_orm=lambda pk: User.collection.get(id=pk), )

you can do

class Model(...): user = MetadataField( ..., orm_class=User, )

The various ORM classes that may be used in orm_class are then defined in project.entry-points.'aiida.orm', so that you can do in Entity.model_to_orm_field_values(...):

if orm_class := get_metadata(field, 'orm_class'): if isinstance(orm_class, str): orm_class = BaseFactory('aiida.orm', orm_class) fields[key] = orm_class.collection.get(id=field_value)

But then I guess my question is, do these not already have entry points? Or is the problem that you can't fetch them all generically using the BaseFactory?

codecov · 2024-07-08T07:57:27Z

Codecov Report

Attention: Patch coverage is 94.48276% with 24 lines in your changes missing coverage. Please review.

Project coverage is 77.97%. Comparing base (ef60b66) to head (d4c4d94).
Report is 110 commits behind head on main.

Files	Patch %	Lines
src/aiida/orm/entities.py	89.10%	6 Missing ⚠️
src/aiida/orm/comments.py	70.59%	5 Missing ⚠️
src/aiida/orm/authinfos.py	86.96%	3 Missing ⚠️
src/aiida/orm/fields.py	81.25%	3 Missing ⚠️
src/aiida/common/pydantic.py	80.00%	2 Missing ⚠️
src/aiida/orm/nodes/node.py	93.11%	2 Missing ⚠️
src/aiida/cmdline/groups/dynamic.py	87.50%	1 Missing ⚠️
src/aiida/orm/groups.py	94.12%	1 Missing ⚠️
src/aiida/orm/nodes/data/singlefile.py	93.34%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6255      +/-   ##
==========================================
+ Coverage   77.51%   77.97%   +0.47%     
==========================================
  Files         560      566       +6     
  Lines       41444    42279     +835     
==========================================
+ Hits        32120    32964     +844     
+ Misses       9324     9315       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

khsrali · 2024-07-11T12:44:56Z

It would be nice to have some benchmarks,
Just to see if these changes would slow down AiiDA and if, how much. So to evaluate pros and cons.

This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255

* Introducing the kind checks. In principle, we can also not define the kinds, in that case it is up to the plugin to use the get_kinds() method of the StructureData to define kinds. * Starting to migrate to mkdocs * StructureData developer guide, missing: PropertyCollector and properties. * First profiling for the atomistic vs orm StructureData generation. Need more profiling for low-atoms systems, where it seems orm is better * Comment on profiling * Additional notes on profilings * Adding a last scaling test, to be commented and inserted in the github issue * Adding the automatic kind generation. Missing: deactivate this automatism. * Adding the `allow_kinds` input in the StructureData init. used to deactivate the automatic kind generation. * Adding the `to_legacy_structuredata` method in atomistc.StructureData This is provided in temporary way in order to make easier to migrate the plugin gradually to this new Data. In a first step, we should use this method anytime the plugin uses the StructureData. Then add the properties setting. * - Adding the developer docs - Adding some bugfixing for the empty StructureData init. * Added user guide with examples * Fixing #16 * This solves issue #10 This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255 * New implementation, following the pydantic PR. adding the `charges` properties, in the same way as we have the cell and pbc: hard-coding the getter and setter methods. For now, no data validation is there. * Added the support to read charges also from ASE Atoms object. Added the map_kinds method, which return the mapping for the kinds as contained in the sites attribute. * Reverting back to non-pydantic implementation (Just removing the Model attribute from the StructureData class, nothing else changed) Added also the charges in the get_ase method. * Attaching the properties to the Sites. - code refactored: now `StructureData` and `Site` classes are in separated files, as well as the utils; `Kind` class is removed - `Site` class now contains the properties: `symbol`, `position`, `mass`, `charge`, `magnetization`, `kind_name`; `kind_name` should be removed. The properties are defined via the `property` decorator - `from_ase`, `to_ase` methods are added - `from_pymatgen`, `to_pymatgen` methods are added (**TOBE fixed to add also the magnetization**) - `from_file`, `to_file` methods are added. They rely on ASE `io.read` and `io.write` functions. - the `orm.StructureData` (the old one) now has a method `to_atomistic` - the `utils.py` file contains also the `to_kinds` and `get_kinds` functions, to automatically generate kinds. The user should only use the `get_kinds` function. **TOBE refined: kind names, and remove the other kinds methods from the StructureData class.** - the `StructureData` constructor now works in this way: we provide the pbc, cell. Then we should use the `append_atom` method afterwards. **TOBE discussed, I guess we cann add lists contaning site-wise value for properties** - slicing of structure data: defined the __getitem__ method, to obtain a sliced structuredata instance from the initial one. - TOBE added: `to_supercell` method. * adding the possiblity to provide lists in the constructor BUT we are gonna change this by providing the possibility to provide list of sites (each of them is a dictionary), in the same format as they are represented in the database * Moving the get_kinds and _to_kinds (now a private method) into the StructureData class. * Removed kind related methods which now are no more needed in the StructureData * Utilities for cell and sites update * New StructureData and StructureDataMutable both inheriting from StructureDataCore. Added properties: - charge - magmom Relevant added methods: - from/to ASE/Pymatgen - to_legacy - to_structuredata - to_mutable_structuredata Added single page for API tutorial. Added tests for basic functionalities of both Structure Classes. * Allowing also magmom to be provided as floats: they are then stored as [magmom,0,0] * Improving mutability and minor changes. Mutability is improved by removing all the setter methods, both in StructureDataCore and Site classes. Moreover, lists and tuples are returned as np.array with flags.writeable=False, meaning that we cannot even modify the internal arrays/lists. Other minor changes: - Now in StructureDataCore we have to define the pbc, cell and sites keywords to initialise it (and so the other subclasses), this is helpful to then document the inputs. - `get_kinds` routine fixed to work also with magmoms (arrays) - docu and tests updated with respect to these changes. * Adding type hinting and minor fixes removing mutable attribute in Site class adding a get_global_properties method in StructureDataCore, not used now but in the future to have a list of the properties in the database (not only attached to the single sites), to make easy to query wrt a property. * Improving flexibility, automatic get_kinds and pymatgen fixings. Providing full flexibility in the StructureDataMutable while preserving same data structure of StructureData. - adding the parent and index when we initialise the Site class (i.e. when we access structure.sites). This is helpful to be able to change directly properties in the StructureDataMutable class. - in the same direction, we implemented the ObservedArray class which basically allows to trigger the setter methods of arrays/lists even if we modify only one element (for example: structure.sites[0].position[0] = 5). In this way, changes are written into the structure._data, which represents the real data stored in the instance. `get_kinds` method now is stable. Fixing pymatgen problems with oxi_state and magmom. * Adding in the tutorial markdown also the get_kinds method, and minimal update of the pytests.

src/aiida/storage/psql_dos/orm/querybuilder/main.py

src/aiida/storage/psql_dos/orm/authinfos.py

src/aiida/plugins/entry_point.py

edan-bainglass · 2024-08-21T17:01:13Z

src/aiida/orm/utils/node.py

-    prefixes = ('aiida.orm.nodes.',)
+    prefixes = ('aiida.orm.nodes.', 'aiida.orm.core.')


This is related to the changes of adding the aiida.orm entry point group and making it possible to load the ORM classes through an entry point. But I will have to go back to the PR and actually test to see if this is still necessary

src/aiida/orm/utils/mixins.py

src/aiida/orm/nodes/repository.py

src/aiida/orm/nodes/node.py

edan-bainglass · 2024-08-21T17:49:09Z

src/aiida/orm/nodes/data/upf.py

@@ -375,7 +377,7 @@ def set_file(self, file, filename=None):
        from aiida.common.exceptions import ParsingError
        from aiida.common.files import md5_file, md5_from_filelike

-        parsed_data = parse_upf(file)
+        parsed_data = parse_upf(file, check_filename=self.CHECK_FILENAME)


@sphuber do you remember why you added this CHECK_FILENAME class attribute?

src/aiida/orm/nodes/data/singlefile.py

sphuber · 2024-08-21T20:59:40Z

Thanks for the review @edan-bainglass . I see that most your comments are questions. This is fair enough the PR doesn't provide any description of the changes and the reasoning behind it. Since most of your questions all refer to the same concept and so would be answered by the same answer, instead of going through them one by one, I will answer it in general here.

The whole point of this PR is to provide a way that any ORM class can automatically be serialized to JSON and the same instance can be recovered from that serialized JSON payload. This is implemented in the aiida.orm.entities.Entity class, which is the base class for all ORM classes. It provides the serialize and from_serialized methods for this purpose.

Instead of manually implementing JSON (de)serializing for all classes, we use pydantic. Each ORM class has to define a pydantic.BaseModel that captures their "schema". This is the part that replaces the QbFields that you added. The associated functionality is kept but the logic now parses the pydantic model instead of the custom QbField syntax.

For all of this to work, we need to be able to clean map an instance of a pydantic.BaseModel to an instance of the associated ORM class and vice versa. This essentially means that the latter's constructor needs to accept all the fields defined by the model and it needs to provide properties to retrieve the value for each field from an ORM instance. This is necessary because the implementation of from_model and to_model in the Entity class depends on it. I could have chosen to just implement this manually on each class and use the existing properties or go directly through the attributes (in the case of Node). But I want it to be automatic, because then plugin developers don't have to worry about it but they get it out of the box (since their plugin will have Entity as base class, and there it is automated). This is why you see all these changes that add arguments/logic to the constructor and add properties to retrieve field values.

I have been thinking whether those changes should be in separate commits. If we take the example of the AuthInfo, one of its fields is metadata but the constructor currently doesn't allow setting it. Same goes for extras of the Node. But why would that not be allowed? It is arbitrary to allow some fields to be defined, but require others to be set after the instance is created through some other setters. If you think this would help, I guess I could still refactor the PR to factor out those changes into an initial commit.

The changes to the entry point handling, and adding aiida.orm as a entry point group, essentially is to make it possible to actually fully serialize all ORM classes. Certain ORM classes already have a string identifier in the form of an entry point, namely the Data class and all its subclasses, i.e. data plugins. They have an entry point in aiida.data. But a class like Computer does not. If you need to communicate to what class a specific JSON serialized instance belongs, you could just use 'Computer' as a string, but it would be nicer and more consistent if it would be communicated through an entry point like other classes, e.g. aiida.orm:core.computer. This again will make implementing a REST API that can (de)serialize ORM instances out of the box trivial.

edan-bainglass · 2024-08-22T04:42:34Z

@sphuber first, thank you so much for the detailed explanation of the PR. The scope/depth of change + uncharacteristic lack of explanation was a recipe for confusion. Note that at the time of you writing the above, I was only "half"-way through my review, as I had started it quite late in the day. I'm wrapping up the review at the moment, but with the above clarifications, the PR is now mostly crystal clear! And I think you can skip PR refactoring. It all makes sense now as a whole.

Thanks for implementing this! It is pertinent to my current use case (query builder web app), which requires JSON (de)serialization of nodes via REST API calls. I believe you had mentioned before when we worked on the fields PR that this would precisely be where these changes would make an impact. Looking forward to this dropping!

edan-bainglass

@sphuber thanks again for implementing this. As mentioned in my other comment, it is mostly clear. Left a few comments after your post that I think would still benefit from clarification. Other than that, LGTM!

src/aiida/common/pydantic.py

src/aiida/orm/comments.py

src/aiida/orm/authinfos.py

src/aiida/cmdline/groups/dynamic.py

sphuber · 2024-08-22T08:40:49Z

@edan-bainglass one reason why I hadn't (yet) written an extensive documentation in the commits here, is because I actually put this in an AEP. There you will find an extensive description of the goal, design, implementation and considerations. I also provide examples of how these changes make it then trivial to write a REST API around it. I provide a working example with FastAPI.

As I mentioned in a comment elsewhere, I will quickly revisit this and see if the changes to the plugin loading system are still required for the REST API use case. If not, I will revert those. I would encourage yourself to also start testing this PR out with your envisioned use case to see what works and what doesn't and what might be missing.

edan-bainglass · 2024-11-27T09:15:13Z

It would be nice to have some benchmarks, Just to see if these changes would slow down AiiDA and if, how much. So to evaluate pros and cons.

@khsrali you mentioned you had some test cases. Could you provide these benchmarks against your scripts? 🙏

khsrali · 2024-11-28T10:53:09Z

It would be nice to have some benchmarks, Just to see if these changes would slow down AiiDA and if, how much. So to evaluate pros and cons.

@khsrali you mentioned you had some test cases. Could you provide these benchmarks against your scripts? 🙏

To be more precise, I have a scriptic way to run performance tests for aiida-core in general, but I don't have performance tests ready-to-go for this PR. I need some times to be done with some other projects, then will do some thinking to design relevant tests.

edan-bainglass · 2024-12-03T06:46:54Z

@sphuber btw, if you recall, we had tested the possibility of avoiding having to explicitly extend the Models of the parents of an ORM class in its own Model, i.e. just class Model: instead of, for example, class Model(Data.Model), etc. I think I may have resolved this matter. I can say more about the solution, but I'd like to get your opinion on it once more. I am slightly torn between the pros and cons of implicit inheritance. The downside is implicit inheritance, i.e., less clear? The upside, other than the obvious reduced burden on developers, is the assurance that the behaviors and attributes of all parent(s) Models are included, i.e., no risk of missing inheritance due to developer error. Thoughts?

edan-bainglass · 2024-12-03T07:10:05Z

@sphuber I did a small test on an old environment where I had a single python Code. Serialization of this code node yielded None for prepend_text, which in turn failed the validation when trying to create an InstalledCode using from_serialized, as prepend_text is expecting a string. Indeed, when I create a new code without prepend text, I get for the field an empty string, not None. I'm not sure how the old code ended up with None for the field. I tried creating codes explicitly and via CLI arguments, but I always end up with an empty string for the field, not None. Do you happen to know how a Code might end up with None for prepend_text? I suspected the default was changed at some point in the past, but I failed to track it down.

sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from 3848afd to c253259 Compare January 22, 2024 12:53

sphuber force-pushed the feature/orm-pydantic-models branch from c253259 to fc52f41 Compare January 30, 2024 16:11

danielhollas mentioned this pull request Feb 4, 2024

Lazy pydantic import #6275

Merged

edan-bainglass mentioned this pull request Feb 6, 2024

✨ NEW: Add orm.Entity.fields interface for QueryBuilder (cont.) #6245

Merged

sphuber mentioned this pull request Feb 28, 2024

Add CLI command to dump inputs/outputs of CalcJob/WorkChain #6276

Merged

sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 49379d9 to d1659b4 Compare March 12, 2024 07:00

sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from a7a6065 to 069bd1f Compare March 20, 2024 16:45

sphuber mentioned this pull request Mar 20, 2024

Next-gen AiiDA provenance explorer materialscloud-org/issues#33

Open

sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 655fcf0 to d778da8 Compare March 25, 2024 12:35

sphuber force-pushed the feature/orm-pydantic-models branch from 07f3001 to 054c43c Compare March 27, 2024 21:55

sphuber force-pushed the feature/orm-pydantic-models branch from 054c43c to 1c65856 Compare April 19, 2024 12:15

sphuber mentioned this pull request Apr 19, 2024

Thoughts on JSON serialisation and pydantic? aiidateam/aiida-atomistic#11

Open

sphuber force-pushed the feature/orm-pydantic-models branch from 7c7ef64 to 1e7f536 Compare April 19, 2024 12:58

mikibonacci added a commit to mikibonacci/aiida-atomistic that referenced this pull request May 8, 2024

This solves issue aiidateam#10

5dcd3c9

This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255

agoscinski reviewed May 15, 2024

View reviewed changes

mikibonacci mentioned this pull request May 30, 2024

Immutability and usability aiidateam/aiida-atomistic#9

Closed

sphuber force-pushed the feature/orm-pydantic-models branch from 1e7f536 to 66f6096 Compare July 6, 2024 11:25

unkcpz reviewed Jul 7, 2024

View reviewed changes

sphuber force-pushed the feature/orm-pydantic-models branch from a7c6c13 to 9816200 Compare July 8, 2024 07:27

sphuber mentioned this pull request Jul 8, 2024

👌 CLI: Computer/Code export output_file optional #6486

Merged

agoscinski pushed a commit to aiidateam/aiida-atomistic that referenced this pull request Jul 17, 2024

This solves issue #10

7cb2610

This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255

sphuber force-pushed the feature/orm-pydantic-models branch from 9816200 to d4c4d94 Compare August 9, 2024 07:56